The MULTIVOC Text-To-Speech System

نویسندگان

Olivier M. Emorine

Pierre M. Martin

چکیده

In this paper we introduce MULTIVOC, a real-world text-to-speech product geared to the French language. Starting from a ordinary French text, MULTIVOC generates in real-time a high quality speech using a synthesis-by-diphone method. The processing is divided into 3 main transformations (phonetization, automatic prosody and rhythm marking, and generation of LPC frames). This paper provides a full description of MULTIVOC including not only the technical view but also some applications of the product within the real world. 1. PRESENTATION OF MULTIVOC The text-to-speech MULTIVOC system is the result of a technology transfer from a research institute (CNET Lannion, France), which developed the basis of the system, to an industrial company (Cap Sogeti Innovation, France) which made the system a commercial product. Generating Linear Prediction Coding frames from ordinary text written in French, the goal of MULTIVOC is to give any standard applications the ability to produce (in real time) low-cost and high-quality speech output. MULTIVOC is shipped as a complete software system which aims to provide a sophisticated driver enabling applications to directly send French spoken text. The software package consists of the kernel of the driver itself and a set of dictionaries used by it. Several tools in the package allow an advanced user to tailor his own MULTIVOC driver to specific usage. Beside this static configuration facility, MULTIVOC also provides several run-time features. By submitting specific requests an application can change the following parameters: • The sampling frequency for generated frames. Three different frequencies are available: 8 kHz, 10 kHk and 16 kHz. This parameter will characterize the quality of the output voice, a frequency of 16 kHz providing the best results. • The tone of the output voice can be adjusted in the range 50-350 Hz. • The speech speed may be set from 1 to 10 syllables per second. • Two styles of prosody are provided. The "reading-style" corresponds to the usual way of reading a text, while the "advertising-style" is dedicated to short commercial messages like jingles. • One can also choose between a female or a male voice. The method used for the synthesis produces Linear Prediction Coding (LPC) frames generated from a diphone dictionary. Such a dictionary is specific to the sampling frequency used (8, 10 or 16kHz) and also to the style of voice (Female or Male). For this purpose, MULTIVOC provides 6 differents diphone dictionaries. The overall processing is organized as a pipelined set of transformations applied to the input text. At the higher level, one can distinguish the following functions: The pre-processing (or lexical processing) is a text-to-text transformation aiming to expande some non-worded terms like numbers (1987 --> "Mille Neuf Cent Quatre-Vingt-Sept"), administrative numbers (A4/B5 --> "A Quatre B Cinq") or acronyms (CSINN. --> "Cap Sogeti Innovation"). The phonetization process transforms the pre-processed text into phonemes according to predefined rules stored in a user-modifiable base. The prosody marking process scans the phonetized text and generates appropriate marks to reflect the prosody of the text using built-in rules based on the different punctuation signs and the grammatical type of words. The rhythm marking process computes the duration associated to each phoneme. Last, the frame generation process produces the LPC frames which correspond to the input text according to the different parameters specified and can be sent directly to the output device.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cipher text only attack on speech time scrambling systems using correction of audio spectrogram

Recently permutation multimedia ciphers were broken in a chosen-plaintext scenario. That attack models a very resourceful adversary which may not always be the case. To show insecurity of these ciphers, we present a cipher-text only attack on speech permutation ciphers. We show inherent redundancies of speech can pave the path for a successful cipher-text only attack. To that end, regularities ...

متن کامل

Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting

Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...

متن کامل

Study on Unit-Selection and Statistical Parametric Speech Synthesis Techniques

One of the interesting topics on multimedia domain is concerned with empowering computer in order to speech production. Speech synthesis is granting human abilities to the computer for speech production. Data-based approach and process-based approach are the two main approaches on speech synthesis. Each approach has its varied challenges. Unit-selection speech synthesis and statistical parametr...

متن کامل

L2 Learners’ Lexical Inferencing: Perceptual Learning Style Preferences, Strategy Use, Density of Text, and Parts of Speech as Possible Predictors

This study was intended first to categorize the L2 learners in terms of their learning style preferences and second to investigate if their learning preferences are related to lexical inferencing. Moreover, strategies used for lexical inferencing and text related issues of text density and parts of speech were studied to determine their moderating effects and the best predictors of lexical infe...

متن کامل

Off-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model

In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1988

The MULTIVOC Text-To-Speech System

نویسندگان

چکیده

منابع مشابه

Cipher text only attack on speech time scrambling systems using correction of audio spectrogram

Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting

Study on Unit-Selection and Statistical Parametric Speech Synthesis Techniques

L2 Learners’ Lexical Inferencing: Perceptual Learning Style Preferences, Strategy Use, Density of Text, and Parts of Speech as Possible Predictors

Off-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model

عنوان ژورنال:

اشتراک گذاری